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1 Abstract 

Conventional ensemble learning combines students in the space domain. On the other hand, in this 
paper we combine students in the time domain and call it time domain ensemble learning. In this 
paper, we analyze the generalization performance of time domain ensemble learning in the framework of 
online learning using a statistical mechanical method. We treat a model in which both the teacher and 
the student are linear perceptrons with noises. Time domain ensemble learning is twice as effective as 
conventional space domain ensemble learning. 
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2 Introduction 

Learning is to infer the underlying rules that dominate data generation using observed data. Observed 
data are input-output pairs from a teacher and are called examples. Learning can be roughly classified 
into batch learning and online learning [J. In batch learning, given examples are used more than once. In 
this paradigm, a student will give the correct answers after training if that student has adequate degree 
of freedom. However, it is necessary to have a long time and a large memory in which many examples 
are stored. On the contrary, in online learning examples used once are then discarded. In this case, a 
student cannot give correct answers for all the examples used in training. However, there are merits, for 
example, a large memory for storing many examples is not necessary, and it is possible to follow a time 
variant teacher. 

Recently, we [2 E] analyzed the generalization performance of some models in a framework of online 
learning using a statistical mechanical method 0] l^j ■ Ensemble learning means to combine many rules 
or learning machines (called students in this paper) that perform poorly; it has recently attracted the 
attention of many researchers [01 13 |H1 El E| ■ The diversity or variety of students is important in ensemble 
learning. We showed that the three well-known rules, Hebbian learning, perceptron learning, and AdaTron 
learning have different characteristics in their affinities for ensemble learning, that is in "maintaining 
diversity among students" 00 In that process, the following points were proven subsidiarily |12l 1181 
1141 115| . The student vector doesn't converge in one direction but continues moving in an unlearnable 
case jlUI 111) . Therefore, we also analyzed the generalization performance of a student supervised by a 
moving teacher that goes around a true teacher 0). As a result, it was proven that the generalization 
error of a student can be smaller than a moving teacher, even if the student only uses examples from the 
moving teacher. In an actual human society, a teacher observed by a student does not always present the 
correct answer. In many cases, the teacher is learning and continues to change. Therefore, the analysis 
of such a model is interesting for considering the analogies between statistical learning theories and an 
actual human society. 

In conventional ensemble learning, the generalization performance is improved by combining students 
who have diversities. On the other hand, a student does not always converge in one direction but may 
continue moving in an unlearnable model. Therefore, the generalization performance in such a model 
must be improved by combining the student itself at different times even if there is only one student 
|14[ll5j . Conventional ensemble learning combines students in the space domain. On the other hand, we 
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introduce a method of combining the students in the time domain; we call this time domain ensemble 
learning. In this paper, we analyze the generalization performance of the time domain ensemble learning 
using a statistical mechanical method. We treat a model in which both the teacher and the student are 
linear perceptrons|2] with noises. We obtain the order parameters and generalization errors analytically 
in a framework of online learning using a statistical mechanical method. 



3 Model 

In this paper we consider a teacher and a student. They are linear perceptrons with the connection 
weights B and J™, respectively. Here, m denotes the time step. For simplicity, the connection weights 
of the teacher and the student are simply called the teacher and the student, respectively. Teacher 
B = {Bi,...,Bn), student J™ = ( J{", . . . , J^J'), and input (xf , . . . , x^) are N dimensional 

vectors. Each component Bi of B is independently drawn from A/'(0, 1) and fixed, where A/'(0, 1) denotes 
a Gaussian distribution with a mean of zero and variance unity. Each component Jf of the initial value 
J° of J™ is independently drawn from A/'(0, 1). The direction cosine between J™ and B is i?™ and 
that between J™ and is g™'™ . Each component a;™ of a;™ is drawn from A/'(0, independently. 
Thus, 

{B.) = 0, {{B,f) = l, (1) 
{J^) - 0, ((J°)')=l, (2) 
{^T) = 0, = (3) 

^ nfn'ntn,, . (4) 
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where (•) denotes a mean. 

Figuren]illustrates the relationship among teacher B, students J™ and J™ and the direction cosines 
R"\R"'\ and g™'"'. 



B 




Figure 1: Teacher B and students J'" and J™ . R"\R"^ , and g™'™ are direction cosines. 



In this paper, the thermodynamic limit iV — > oo is also treated. Therefore, 

||B||=%/iV, yW^^N, lla;'"!!^!. (6) 

Generally, the norm || J'"|| of the student changes as the time step proceeds. Therefore, the ratios of 
the norm to a/ZV are introduced and are called the length of the student. That is, || J™|| = 1"^Vn. 



Both the teacher and the student are Hnear perceptrons. Then outputs are + and u^V^ + nj, 
respectively. Here, 

= S-a;", (7) 

- AA(0,4), (9) 
AA(0,c7}), (10) 

where A/'(0, ct^) denotes a Gaussian distribution with a mean of zero and variance cr^. That is, the 



n 



outputs of the teacher and the student include independent Gaussian noises with variances of and Cj 
respectively. Then, and u™ obey Gaussian distributions with a mean of zero and variance unity. 

Let us define the error e™ between the teacher B and the student J™ alone by the squared error of 
their outputs: 

eg =-{v +ng-u I ~nj) . (11) 

Student J™ adopts the gradient method as a learning rule and uses input x and an output of teacher 
B for updates. That is, 

am 

jm+1 ^ J"-^^ (12) 

= J™ + ?7(w™ + n^-u"Z"-n7)a?'", (13) 

where ry denotes the learning rate of the student and is a constant number. Generalizing the learning 
rule, ea.l(T^ can be expressed as 

jm+i ^ J" + /™a;™, (14) 
where / denotes a function that represents the update amount and is determined by the learning rule. 

4 Theory 

4.1 Generalization error 

Ensemble learning means to improve performance by combining many students that perform poorly. On 
the other hand, we use just one student and combine its copies (hereafter called 'brothers') at different 
time steps in this paper. Conventional ensemble learning combines students in the space domain; on the 
other hand, we combine students in the time domain. In this paper K brothers J™^, J"'^, . . . , J™^ are 
combined. Here, mi < m2 < . . . < mx- We use the squared error e for new input x. Here, it is assumed 
that the Gaussian noises of eqs.© and (|10|l are independently added to the teacher and each brother of 
the ensemble, respectively. The weight of each brother J™'' of the ensemble satisfies Ck > 0. That is, 
the error of the ensemble is 




C'kiJ"'" -x + nk)] . (15) 



Here, ns TV (O, cr|) and Uk ^ Af [0,aj). 

A goal of statistical learning theory is to theoretically obtain generalization errors. Since a generaliza- 
tion error is the mean of errors over the distribution of the new input x and noises nB,nk, k = 1, . . . , K, 
the generalization error eg of the ensemble is calculated as follows: 
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J dxduB yl\dnkj p{x)p{nB) \^\\j>{nk)j e (16) 
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(18) 



where v = B ■ x and Ukl^'' — J^'' ■ x. We executed integration using the following: v and Uk obey 
A/'(0,1). The covariance between v and Uk is i?™*, that between Uk and Uk' is g™'"™^'. nB and Uk are 
independent from other probabilistic variables. 



4.2 Differential equations for order parameters and their analytical solutions 

In this paper, we examine the thermodynamic hmit — > oo. Therefore, updates for or H14|l must 
be executed 0(N) times for the order parameters l,R, and q to change by 0(1). Thus, the continuous 
times ti, . . . ,tK, which are the time steps toi, . . . , tuk normahzed by the dimension N, are introduced as 
the superscripts that stand for the learning process. To simphfy the analysis, we introduced the following 
auxiliary order parameters: 



(19) 
(20) 



The simultaneous differential equations in deterministic forms 116', which describe the dynamical 
behaviors of order parameters, have been obtained based on the self-averaging of thermodynamic limits 
as follows: 
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(21) 
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where t' > t and u* = a;* • J*//* - Af{0, 1). 

Since linear perceptrons are used in this paper, the sample averages that appear in the above equations 
can be easily calculated as follows: 



(/V) = rKr*//*-/*), 
(/V> = rKl-r*), 



(24) 
(25) 
(26) 

(27) 



Since all components of teacher B and the initial student J° are independently drawn from Af{0, 1) 

GO is also used, they are orthogonal to each other in the 

R° = 0. (28) 



and because the thermodynamic limit N 
initial state. That is. 

In addition, 



and 



t\2 



(29) 
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using eqs.lO) and (|20|l . Using these initial conditions, we can analytically solve the simultaneous differ- 
ential equations (|^ - lt^ as follows: 
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Substituting eas. (|31(l - (|33() into ea. l|18() . the generalization error eg can be analytically obtained as a 
function of time tk-, k = 1, . . . , K as follows: 



K 



l-2^Cfc(l-e-''*- 
fc=i 

: K 

2^^C,a. (l 



fc=l k'>k 
K 



K 



2e 



+ (2 - a^) e 



2\ „nin~2)tk 



fc=i 
V 



k=l 



2^2 



(4 + ^^) 



(34) 
(35) 



5 Results and Discussion 



The dynamical behaviors of ^* and i?* have been analytically obtained using eas. (|19|l . and H32f) . 

Figures 121 and 121 show some examples of the analytical results and the corresponding simulation results, 
where N — 2000. In these figures, the curves represent theoretical results. The symbols represent 
simulation results. Figure |21 shows the results of ct^ = ctj = 0.0 and no noise. Figure |2I shows the result 
0.2. 

Focusing on the signs of the powers of the exponential functions in ea. (|32|l . we can see that diverges 
if the learning rate satisfies > ry or 77 > 2. converges to 



of ct| = aj 



if < 77 < 2. Equations (fT^ and lp?T)l show that _R* converges to 



(36) 



(37) 



Therefore, we can see that / 
noise. 

Since eq.ljSU shows 



1 in the case of no noise and > 1, R°° < 1 in the case of 



d(/*)2 
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an equation regarding t 



has only one solution 



dt 



= 
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(38) 



(39) 



(40) 



if the learning rate satisfies < 77 < 4/(2 + cr^ + ctj) and 77 ^ 1. Therefore, I* asymptotically approaches 
unity after becoming larger than unity if < 77 < 1 and Z* asymptotically approaches unity after becoming 
smaller than unity if 1 < 77 < 2 as shown in Figure |21a). 
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(a) Dynamical behaviors of Z*. (b) Dynamical behaviors of i?*. 

Figure 2: Dynamical behaviors of and i?*. Theory and computer simulations. 0^% — aj — O.OD 
Equations (HHJl, (EB, and show 
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Therefore, the larger 77 is, the faster R rises as shown in Figs. |2Ib) and|2Ib). However, eas. (|19|l . H31f) . 
(EH), (EH), and (EZI) show 



- R* 



1 r* 



(42) 
(43) 



2.4 

L.L 




Eta=1.8 
Eta=0.2 


2 






l.O 






1.6 










1.4 








1.2 


//■■.^_...-.,-.^-.-. 


■ ■ ■ ■ ■ ■ ■ 


1 




-• 1 ; 1 s s s s- ■ --■"----1 


0.8 







5 10 15 20 



(a) Dynamical behaviors of Z*. 
77 = 1.8, 1.6, 1.4, 1.2, 1.0, 0.8, 0.6, 0.4 and 0.2 from 
the top. 
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(b) Dynamical behaviors of i?*. 
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Figure 3: Dynamical behaviors of and _R*. Theory and computer simulations. cr| 



when t is large. Since eq.^SJ is 0(6"''*) if < < 1 and O (e''^''-^)*) = O [e^'^'^'^f -^)^^ if 1 < < 2, 

the convergence of i?* is the fastest when the learning rate satisfies 77 = 1. This can be confirmed in 
Figure [2tb) and Figure |2fb). This phenomenon can be understood by the fact that 77 = 1 is a special 
condition where the student uses up the information obtained from input x^^. 

We analytically obtained the dynamical behaviors of the generalization error tg and the direction 
cosine q using eas. (pn|l and Figures ^ and |S1 show some examples of the analytical results and 

the corresponding simulation results, where N = 2000. In these figures, the curves represent theoretical 
results. The symbols represent simulation results, tg has been calculated for the simplest case, that is 
K ~ 2^C\ ~ C2 — 1/2. Other conditions are 77 — 1.0 and <y\ — cf^j — 0.2. In the computer simulations. 
Eg was obtained by averaging the squared errors for 10** random inputs at each time step. 

Figure^ shows the relationship between t2 — t\ and e^, (7*1 '*2 in the case of a constant t\. When 
<2 — ii increases, tg increases monotonically, remains constant, or decreases monotonically depending on 
the values of 77. We prove this in the following. Equation 1)341) shows that tg{^K=\) is 

Therefore, tg(K=\) decreases monotonically, remains constant, or increases monotonically as time t pro- 
ceeds. The necessary and sufficient conditions for the above three phenomena are 

respectively. Since the output of the ensemble is the weighted sum of the outputs of the brothers, the 
generalization error for K > \ also decreases monotonically, remains constant, or increases monotonically. 
The necessary and sufficient conditions for these three phenomena are also shown in eas. (|45|) - (|47|l . Since 
the condition of Fig. Efa) agrees with eg . 1)45)1 . the generalization error decreases monotonically. Equations 
(EOJ, (|S21l, l|S3), and ijSnil show that 

gti,t2 jjj ^jjg Qase of ii = asymptotically approaches zero when 
^2 — ^1 — > 00 as shown in Fig. ^b). This means that after a long time the student is orthogonal with its 
initial condition. 

Since the order parameters and tg have been explicitly obtained as functions of t and t' as eas. lj^ - 
()34)l in this paper, the relationships between t\ and e^, g*'* in the case of constant time interval of the 
brothers or constant tk+\ — tk can be calculated. Figure [S] shows the relationship between ti and Cg, 
qti,t2 jn the case of constant ^2 — ^1- For the same reason as in Fig. 01a), the generalization error Eg also 
decreases monotonically in Fig. Eta). Figure E^b) shows that g*i'*2 converges to a smaller value than 
unity in the case of t2 — ti ^ 0.0. This means that the student itself continues to move after the order 
parameters l,R, and q reach a steady state. 

In Figs. 21 and El the generalization error eg and the direction cosine (7*1 '*2 seem to reach a almost 
steady state hy t2 — ti > 5 or ti > 5. The behaviors of eg when the leading time ^ 00 or the 
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Figure 4: Relationship between t2 — ti and e^, q*^'*'^ in the case of constant leading time ti. Theory and 
computer simulations, r] = 1.0, cr^ = ctj = 0.2D 
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Relationship between ti and Eg, q '^-^ in the case of constant time interval t2 — ti. Theory and 
simulations, rj = 1.0, a% = ctj = 0.2D 



time interval tk+i — tk oo can be theoretically obtained since the generalization error and the order 
parameters have been analytically obtained as functions of tk, A: = 1, . . . , fC, as shown in ea. (|34|l . 

At first, eas. (|32|l and (|34|l show that (/*)^ diverges unless < r; < 2. Therefore, the generalization 
error diverges unless < ry < 2. If < < 2, the generalization error can be discussed as follows: 

When ti oo, from eas. (|51|l and we obtain 
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In addition, when the time interval tk+i — tk ^ oo, from ca. (|48|l . we obtain 
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Equation H49() shows that the generalization error decreases as the learning rate 77 decreases regardless 
of K when ii — > 00 and tk+i — tk 00. 

In addition, when the weights are uniform ov Ck — C — from ea. H49|l . we obtain 
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Here, considering the special case K = 1, we obtain 
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li B = J ^ is true, the generalization error must equal the residual error 

1/2 2 \ 



(52) 



caused by noise from ea. l|15|) . which is the definition of error. Therefore, the difference between ea. (|51ll 
and eq.ljSSl 

(53) 



is caused by the disagreement between B and J*^ . 

Next, let us consider another special case, K — 00. If and only if 
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the generalization error must equal the residual error 



(55) 



caused by noise from ea. H15() . which is the definition of error. Equation ()54|l is true since ea.H50|l equals 
ea. ()55|l when K = 00. 

In addition, if cr^ = CTj = cr^, ea. H5U|) changes as follows: 
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(56) 



The relationship between the learning rate 77 and the generalization error eg can be analytically 
obtained using ea. (|56|l when both the leading time ti and the time interval tk+i — tk are large enough, 
and the uniform weight Ck — C — 1/K and cr^ — a^j = 0.5. Figure El shows the analytical results 
and the corresponding simulation results. In the computer simulations, N — 2000, the leading time 
ti = 10, and the time interval tk+i — tk — 10 {ti — tk+i — tk — 20 when 77 = 0.2), we obtained by 
averaging the squared errors for 10^ random inputs at each time step. Figure confirms the following. 
The generalization error decreases as the learning rate 77 decreases. The generalization error decreases 
and converges to the residual error as K increases. 
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Figure 6: Relationship between the learning rate ry and the generalization error e^, when both the leading 
time ti and the time interval tk+i — tk are large enough. Theory and computer simulations. Conditions 
are Ck = C = 1/K and = aj = 0.5. 



In addition, if the learning rate satisfies t] = 1, ea. (|56|l becomes 



Equation H57|) refers to the generalization error eg oi K = oo, which is 1/4 times of that of if = 1 when 
the learning rate satisfies 77 = 1, the uniform weights Ck — ^/K, = aj, ti —> 00, and tk> ^ tk ~* 00. 
Since the generalizaion error eg of the conventional space domain ensemble learning with K — 00, -q — 1, 
Ck — XjK and tr^ = CTj is 1/2 times of that oi K = we can say that the time domain ensemble 
learning is twice as effective as the conventional space domain ensemble learning. We can explain this 
difference as follows: In conventional space domain ensemble learning, the similarities among students 
become high since all students use the same examples for learning. On the other hand, in time domain 
ensemble learning, the similarities among brothers become low since all brothers use almost totally 
different examples for learning. 

6 Conclusion 

We analyzed the generalization performance of the time domain ensemble learning in the framework of 
online learning using a statistical mechanical method. We treated a model in which both the teacher and 
the student were linear perceptrons with noises. We showed that the time domain ensemble learning is 
twice as effective as the conventional space domain ensemble learning. 
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